Consider class sizes at Macalester. Here's some data:
classes = fetchData("courses.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/courses.csv
names(classes)
## [1] "sessionID" "dept" "level"
## [4] "sem" "enroll" "iid"
densityplot(~enroll, data=classes)
mean(~enroll, data=classes)
## [1] 21.17
These data are from the college's point of view. They are truthful (but all the classes under size 10 were dropped, because the data were collected for the purpose of studying grades).
The distribution is right-skew, so the mean is bigger than the median.
median(~enroll, data=classes)
## [1] 18
Maybe these data are log-normal:
densityplot(~log(enroll), data=classes)
exp(mean(~log(enroll), data=classes))
## [1] 19.16
We can argue about whether the mean or median provides the better description of the typical class size. But it's more important to think about why one is interested in this at all.
Example Questions:
tally( ~ enroll>=35, data=classes, format="proportion")
##
## TRUE FALSE Total
## 0.08324 0.91676 1.00000
Suppose we transform the data from the student's point of view. For a class of size 35, there are 35 students in the class, so we should replicate the number 35 by 35 times. Similarly, for a class of size 10, the size should be replicated 10 times, for each of the 10 students.
This statement will do that (but you don't need to know a statement like this):
students = with(classes, rep(enroll, times=enroll))
mean(~students)
## [1] 27.13
median(~students)
## [1] 22
tally( ~ students>=35, format="proportion")
##
## TRUE FALSE Total
## 0.1971 0.8029 1.0000
The distribution:
densityplot(~students)